194 research outputs found

    Adding hygiene to gambit scheme

    Full text link
    Le langage de programmation Scheme est reconnu pour son puissant système de macro-transformations. La représentation du code source d'un programme, sous forme de données manipulables par le langage, permet aux programmeurs de modifier directement l'arbre de syntaxe abstraite sous-jacent. Les macro-transformations utilisent une syntaxe similaire aux procédures régulières mais, elles définissent plutôt des procédures à exécuter lors de la phase de compilation. Ces procédures retournent une représentation sous forme d'arbre de syntaxe abstraite qui devra être substitué à l'emplacement de l'appel du transformateur. Les procédures exécutées durant la phase de compilation profitent de la même puissance que celles exécutées durant de la phase d'évaluation. Avec ce genre de système de macro-transformations, un programmeur peut créer des règles de syntaxe spécialisées sans aucun coût additionnel en performance: ces extensions syntactiques permettent l'abstraction de code sans les coûts d'exécution habituels reliés à la création d'une fermeture sur le tas. Cette représentation pour le code source de Scheme provient directement du langage de programmation Lisp. Le code source est représenté sous forme de listes manipulables de symboles, ou bien de listes contenants d'autres listes: une structure appelée S-expression. Cependant, avec cette approche simpliste, des conflits de noms peuvent apparaître. En effet, l'association référée par un certain identifiant est déterminée exclusivement par le contexte lexical de celui-ci. En déplaçant un identifiant dans l'arbre de syntaxe abstraite, il est possible que cet identifiant se retrouve dans un contexte lexical contenant une certaine association pour un identifiant du même nom. Dans de tels cas, l'identifiant déplacé pourrait ne plus référer à l'association attendue, puisque cette seconde association pourrait avoir prévalence sur la première. L'assurance de transparence référentielle est alors perdue. En conséquence, le choix de nom pour les identifiants vient maintenant influencer directement le comportement du programme, générant des erreurs difficiles à comprendre. Les conflits de noms peuvent être corrigés manuellement dans le code en utilisant, par exemple, des noms d'identifiants uniques. La préservation automatique de la transparence référentielle se nomme hygiène, une notion qui a été beaucoup étudiée dans le contexte des langages de la famille Lisp. La dernière version du Scheme revised report, utilisée comme spécification pour le langage, étend ce dernier avec un support pour les macro-transformations hygiéniques. Jusqu'à maintenant, l'implémentation Gambit de Scheme ne fournissait pas de tel système à sa base. Comme contribution, nous avons ré-implémenter le système de macro de Gambit pour supporter les macro-transformations hygiéniques au plus bas niveau de l'implémentation. L'algorithme choisi se base sur l'algorithme set of scopes implémenté dans le langage Racket et créé par Matthew Flatt. Le langage Racket s'est grandement inspiré du langage Scheme mais, diverge sur plusieurs fonctionnalités importantes. L'une de ces différences est le puissant système de macro-transformation sur lequel Racket base la majorité de ses primitives. Dans ce contexte, l'algorithme a donc été testé de façon robuste. Dans cette thèse, nous donnerons un aperçu du langage Scheme et de sa syntaxe. Nous énoncerons le problème d'hygiène et décrirons différentes stratégies utilisées pour le résoudre. Nous justifierons par la suite notre choix d'algorithme et fourniront une définition formelle. Finalement, nous présenterons une analyse de la validité et de la performance du compilateur en comparant la version originale de Gambit avec notre version supportant l'hygiène.The Scheme programming language is known for its powerful macro system. With Scheme source code represented as actual Scheme data, macro transformations allow the programmer, using that data, to act directly on the underlying abstract syntax tree. Macro transformations use a similar syntax to regular procedures but, they define procedures meant to be executed at compile time. Those procedures return an abstract syntax tree representation to be substituted at the transformer's call location. Procedures executed at compile-time use the same language power as run-time procedures. With the macro system, the programmer can create specialized syntax rules without additional performance costs. This also allows for code abstractions without the expected run-time cost of closure creations. Scheme's representation of source code using values inherits that virtue from the Lisp programming language. Source code is represented as a list of symbols, or lists of other lists: a structure coined S-expressions. However, with this simplistic approach, accidental name clashes can occur. The binding to which an identifier refers to is determined by the lexical context of that identifier. By moving an identifier around in the abstract syntax tree, it can be caught within the lexical context of another binding definition with the same name. This can cause unexpected behavior for programmers as the choice of names can create substantial changes in the program. Accidental name clashes can be manually fixed in the code, using name obfuscation, for instance. However, the programmer becomes responsible for the program's safety. The automatic preservation of referential transparency is called hygiene and was thoroughly studied in the context of lambda calculus and Lisp-like languages. The latest Scheme revised report, used as a specification for the language, extend the language with hygienic macro transformations. Up to this point, the Gambit Scheme implementation wasn't providing a built-in hygienic macro system. As a contribution, we re-implemented Gambit's macro system to support hygienic transformations at its core. The algorithm we chose is based on the set of scopes algorithm, implemented in the Racket language by Matthew Flatt. The Racket language is heavily based on Scheme but, diverges on some core features. One key aspect of the Racket language is its extensive hygienic syntactic macro system, on which most core features are built on: the algorithm was robustly tested in that context. In this thesis, we will give an overview of the Scheme language and its syntax. We will state the hygiene problem and describe different strategies used to enforce hygiene automatically. Our algorithmic choice is then justified and formalized. Finally, we present the original Gambit macro system and explain the changes required. We also provide a validity and performance analysis, comparing the original Gambit implementation to our new system

    Advanced Document Description, a Sequential Approach

    Get PDF
    To be able to perform efficient document processing, information systems need to use simple models of documents that can be treated in a smaller number of operations. This problem of document representation is not trivial. For decades, researchers have tried to combine relevant document representations with efficient processing. Documents are commonly represented by vectors in which each dimension corresponds to a word of the document. This approach is termed “bag of words”, as it entirely ignores the relative positions of words. One natural improvement over this representation is the extraction and use of cohesive word sequences. In this dissertation, we consider the problem of the extraction, selection and exploitation of word sequences, with a particular focus on the applicability of our work to domain-independent document collections written in any language

    ICDAR 2019 Competition on Post-OCR Text Correction

    Get PDF
    International audienceThis paper describes the second round of the ICDAR 2019 competition on post-OCR text correction and presents the different methods submitted by the participants. OCR has been an active research field for over the past 30 years but results are still imperfect, especially for historical documents. The purpose of this competition is to compare and evaluate automatic approaches for correcting (denoising) OCR-ed texts. The present challenge consists of two tasks: 1) error detection and 2) error correction. An original dataset of 22M OCR-ed symbols along with an aligned ground truth was provided to the participants with 80% of the dataset dedicated to training and 20% to evaluation. Different sources were aggregated and contain newspapers, historical printed documents as well as manuscripts and shopping receipts, covering 10 European languages (Bulgarian, Czech, Dutch, English, Finish, French, German, Polish, Spanish and Slovak). Five teams submitted results, the error detection scores vary from 41 to 95% and the best error correction improvement is 44%. This competition, which counted 34 registrations, illustrates the strong interest of the community to improve OCR output, which is a key issue to any digitization process involving textual data

    Receipt Dataset for Fraud Detection

    Get PDF
    International audienceThe aim of this paper is to introduce a new dataset initially created to work on fraud detection in documents. This dataset is composed of 1969 images of receipts and the associated OCR result for each. The article details the dataset and its interest for the document analysis community. We indeed share this dataset with the community as a benchmark for the evaluation of fraud detection approaches

    Dataset for Temporal Analysis of English-French Cognates

    Get PDF
    Languages change over time and, thanks to the abundance of digital corpora, their evolutionary analysis using computational techniques has recently gained much research attention. In this paper, we focus on creating a dataset to support investigating the similarity in evolution between different languages. We look in particular into the similarities and differences between the use of corresponding words across time in English and French, two languages from different linguistic families yet with shared syntax and close contact. For this we select a set of cognates in both languages and study their frequency changes and correlations over time. We propose a new dataset for computational approaches of synchronized diachronic investigation of language pairs, and subsequently show novel findings stemming from the cognate-focused diachronic comparison of the two chosen languages. To the best of our knowledge, the present study is the first in the literature to use computational approaches and large data to make a cross-language diachronic analysis.Peer reviewe

    Automatic Matching and Expansion of Abbreviated Phrases without Context

    Get PDF
    International audienceIn many documents, like receipts or invoices, textual information is constrained by the space and organization of the document. The document information has no natural language context, and expressions are often abbreviated to respect the graphical layout, both at word level and phrase level. In order to analyze the semantic content of these types of document, we need to understand each phrase, and particularly each name of sold products. In this paper, we propose an approach to find the right expansion of abbreviations and acronyms, without context. First, we extract information about sold products from our receipts corpus and we analyze the different linguistic processes of abbreviation. Then, we retrieve a list of expanded names of products sold by the company that emitted receipts, and we propose an algorithm to pair extracted names of products with the corresponding expansions. We provide the research community with a unique document collection for abbreviation expansion

    Automatic Discovery of Word Semantic Relations

    Get PDF
    In this paper, we propose an unsupervised methodology to automatically discover pairs of semantically related words by highlighting their local environment and evaluating their semantic similarity in local and global semantic spaces. This proposal di®ers from previous research as it tries to take the best of two different methodologies i.e. semantic space models and information extraction models. It can be applied to extract close semantic relations, it limits the search space and it is unsupervised

    Extended Overview of HIPE-2022: Named Entity Recognition and Linking in Multilingual Historical Documents

    Full text link
    This paper presents an overview of the second edition of HIPE (Identifying Historical People, Places and other Entities), a shared task on named entity recognition and linking in multilingual historical documents. Following the success of the first CLEF-HIPE-2020 evaluation lab, HIPE-2022 confronts systems with the challenges of dealing with more languages, learning domain-specific entities, and adapting to diverse annotation tag sets. This shared task is part of the ongoing efforts of the natural language processing and digital humanities communities to adapt and develop appropriate technologies to efficiently retrieve and explore information from historical texts. On such material, however, named entity processing techniques face the challenges of domain heterogeneity, input noisiness, dynamics of language, and lack of resources. In this context, the main objective of HIPE-2022, run as an evaluation lab of the CLEF 2022 conference, is to gain new insights into the transferability of named entity processing approaches across languages, time periods, document types, and annotation tag sets. Tasks, corpora, and results of participating teams are presented. Compared to the condensed overview [1], this paper contains more refined statistics on the datasets, a break down of the results per type of entity, and a discussion of the ‘challenges’ proposed in the shared task

    Find it! Fraud Detection Contest Report

    Get PDF
    International audienceThis paper describes the ICPR2018 fraud detection contest, its data set, evaluation methodology, as well as the different methods submitted by the participants to tackle the predefined tasks. Forensics research is quite a sensitive topic. Data are either private or unlabeled and most of related works are evaluated on private datasets with a restricted access. This restriction has two major consequences: results cannot be reproduced and no benchmarking can be done between every approach. This contest was conceived in order to address these drawbacks. Two tasks were proposed: detecting documents containing at least one forgery in a flow of documents and spotting and localizing these forgeries within documents. An original dataset composed of images and texts of French receipts was provided to participants. The results they obtained are presented and discussed
    corecore